Oils Is Exterior-First (Code, Text, and Structured Data)

2023-06-20

I introduced a distinction in Narrow Waists Can Be Interior or Exterior:

Interior means within a process, like PyObject*.
Exterior means outside or between processes, like Unix files.

This post uses the interior-exterior idea to describe Oils. For example, OSH and YSH are exterior-first, while PowerShell, Elvish, Nushell, and others are interior-first. To see this, we review three aspects of the design:

Units of code. YSH functions are interior, while "procs" are both interior and exterior.
Text. Oils uses UTF-8 strings both in memory and "on the wire". The concept of a Unicode encoding is an interior vs. exterior issue.
Structured data has two closely related forms:
- Interior: garbage-collected data structures in memory.
- Exterior: JSON-based data languages on the wire.

Recap

This is the third post in a series about YSH. I want it to be a "design roadmap" for contributors and for me, but I hope casual readers will also take something away.

Reviewing YSH - 7 language features, 3 arguments for structured data, ...
Sketches of YSH Features - 14 use cases for blocks, ...
Oils Is Exterior-First. This post returns to our #software-architecture ideas to explain the design from another perspective.

I "forked" another post while writing this one: How to Create a UTF-16 Surrogate Pair by Hand, with Python. It started with an implementation detail, and led to a good discussion about Unicode history, e.g. Windows vs. Unix.

Should Shells Have Two Tiers?

Now let's see how the distinction helps us with the design of YSH. Last year, The Sketch of the Biggest Idea in Software Architecture asked:

Should shells have two tiers? Both external processes and internal "functions"?

Both pipelines of bytes and pipelines of structured data?

We now have answers. YSH will have both:

Python-like functions, which are interior only, and
Shell-like procs which are interior or transparently exterior. (Details below.)

On the other hand, our pipelines are identical to those that Thompson's original shell pioneered: "real" OS processes communicating over channels created with pipe(). Structured data is layered on top, with textual data languages based on JSON and TSV.

Survey of alternative shells

This table summarizes my impressions of a few alternative shells (corrections are welcome):

Style	Shell	VM / Scheduler	What's in a pipeline?	What's Piped?
Interior	PowerShell	.NET VM	cmdlet, a kind of class	.NET objects, instances of classes
Interior	Elvish	Goroutine scheduler	Function, Wrapped Process	Garbage-collected Go records, or JSON
Interior	Nushell	Rust async I/O scheduler	Builtins or Plugins	Rust/serde Objects, JSON/msgpack
Exterior	Oils	Unix Kernel	procs or processes	Bytes ⊃ Text ⊃ Data Languages

So it appears that most alternative shells are interior-first, but Oils is exterior-first.

The distinction isn't black and white: All shells have both facilities (even bash), so it's more a matter of what's "primary" in the design. It's also a matter of how awkward the interface is — do you have two different "worlds" or tiers to bridge?

Nonetheless, we'll say an interior-first shell favors code that lives within a process, while an exterior-first shell favors coordinating data between processes.

Exterior designs are layered

Notice the layering:

Bourne shell starts with bytes from the kernel, and layers conventions on top.
One convention is the line-based structure that grep, awk, and sed use, which requires understanding ASCII '\n'.
UTF-8 text is layered top of ASCII, in a compatible way.
Oils layers structured data languages on top of UTF-8 text.
- I introduced the name for these data languages in the last post: J8 Notation. It consists of J8 Strings, records (JSON8), and tables (TSV8).

I would draw this as:

Bytes ⊃ ASCII ⊃ UTF-8 ⊃ Data Languages

I'd also say the exterior style is one level below interior shells, which preserves shell's role as universal glue. If you want to glue together a .NET VM and a Go process, or Clojure program and an R script, your lowest common denominator is probably a pipe, socket, or bash script.

YSH whipupitude

So does YSH have two tiers? Despite having both proc and func, I'm trying to avoid two tiers, at least to the extent that it reduces the whipupitude of shell.

It's "wrong" to think about YSH programs in this Python-like way:

First you manipulate garbage-collected data structures in memory
Then you serialize them to some format.

The "right way" is to program directly with text, including our data languages for strings, records and tables. They are designed to eliminate ad-hoc parsing, which is the main downside of text.

Our in-memory data structures map one-to-one with text, and are in service of text. The encode() and decode() operations on J8 strings are perfect inverses, for arbitrary byte strings.

More details below.

Code

Procs are Interior or Exterior

Why did I say procs are either interior or transparently exterior? Because that's how Bourne shell works, and it's powerful and underused. The simplest usage of a proc occurs in a single process, making it interior:

myproc() {
  cp *.py /tmp
  echo done
}
myproc  # interior call

But you have at least two ways of making procs exterior:

Shell Has a Forth-like Quality
- sudo $0 myproc "$@" is the $0 Dispatch Pattern. This is how you run a shell function with different privileges. The proc itself becomes a child process of sudo.
Pipelines Support Vectorized, Point-Free, and Imperative Style
- myproc | wc -l transparently "remotes" myproc into another process, via fork()
- Related nice idea: Teleforking a Process onto a Different Computer (thume.ca)

(Implementation status: procs exist in YSH, but we still need to be implement functions.)

FAQ: `proc main` versus `func main`

This is a good time to answer a great question from Mastodon. I expect it to be common, so I'll paraphrase:

Now that YSH has functions, can we just ignore procs? Start with func main, and call other functions with typed data?

I don't want to dictate the way people write code, but I think there are some downsides:

You'd lose the 2 styles of composition I mentioned above: Forth-like words, and pipelines.
- That is, procs are shaped like external processes.
- Shell's "standard library" like coreutils lives in external processes.
- Your own tools in Python, Rust, Elixir, etc. live in external processes.
When units of code are the same shape, they compose, and there's less room for bugs. Shell is a language that grows.
The exterior usage of procs is useful for distributed computing, including building containers. That is, I write shell scripts on one machine, and distribute them to another machine, or to an isolated container. This is analogous to what Docker does implicitly with "contexts".
procs are easy to use and test interactively, e.g. with the "task file" idiom, based on procs.

I test my shell programs multiple times a minute when developing.
Textual flags vs. typed arguments?

Procs with flags can and should have a stable exterior interface! On the other hand, funcs are interior, and may break during refactoring.

Flags often follow Hickey's description of versionless evolution: Strengthen a Promise, Relax a Requirement. Adding a flag is a compatible change, while changing the type of an argument is a breaking change.

Literal command line flags are used to evolve the largest distributed systems at "hyperscalers" like Google. This is because distributed systems can't be upgraded atomically. The site featureflags.io seems to elaborate on this idea.

The advantages of procs will probably become clearer when actually writing code. I should write more #shell-the-good-parts posts with concrete examples, but until then you can see them all over the Oil repo.

Text

Now that we've discussed interior and exterior code, let's discuss text. It's central to not just shell, but all programming languages.

Survey: Programming languages disagree

Text is also complex and controversial. This article, linked in the appendix of the surrogate pair post post, shows that languages disagree on the length operation:

Programming Languages	Length of 🤦🏼‍♂️
Go, Rust, Python 2	17 UTF-8 code units, aka bytes
JavaScript, Java	7 UTF-16 code units
Python 3, bash	5 UTF-32 code units, aka code points
Swift	1 Extended grapheme cluster, which doesn't have a fixed definition

The surrogate pair post also sketches the history of this divergence, which is basically a Unix vs. Windows problem. Languages tend to follow operating systems, so JavaScript, Python, and JSON were dragged along for the 30-year ride.

UTF-8 is for the Interior, not just the Exterior

The length issue correlates with — but isn't identical to — another controversial issue: the representation of strings in memory. That is, the interior representation.

Oils follows the Go language, using an array of bytes, which may or may not be UTF-8 encoded strings:

Strings, bytes, runes and characters in Go (Go blog, 2013)

Contrary to popular belief, and contrary to Python, C, and C++, UTF-8 is a great interior representation. It's naturally compressed in memory, and you can search for ASCII substrings like { or // within it, without decoding.

At some point, I may write Four Reasons New Programming Languages Should Adopt a UTF-8 Centric Design:

New languages use UTF-8 internally: Go, Julia, Rust, Swift, Elixir, ...
Older languages are moving toward UTF-8: Python 3, Ruby, Java
Windows is taking steps toward UTF-8, starting in 2019

Those 3 reasons should be enough. If not, PyPy showed us in 2019 how to use UTF-8 internally, while still retaining O(1) random code point access. You probably don't need this operation, but if you really do, it can be made both time- and space-efficient.

Important: even though Oils is UTF-8 centric, it works languages that use any string representation. The post above would explain why we're diverging from bash.

I would also mention a bug I found in 2018: bash's ${#s}, which measures length in code points, is a non-monotonic function of bytes. That is, adding a byte on the end of a string can reduce its length! This happens because bash doesn't handle invalid UTF-8 properly.

Possible APIs for YSH

Still, I recognize there is a tremendous amount of confusion around strings and UTF-8. We could make our APIs more explicit:

Instead of len(s), we could have

s->numBytes()     # O(1)
s->countRunes()   # O(n), may raise decode error

Decoding:

$ var runes = s->toRunes()
$ write (runes)
[65, 20, 66, ... ]

$ var s2 = Str.fromRunes(runes)  # not a method?

Indexing:

s->byteAt(i)      # O(1)
s->findCharAt(i)  # NOT useful, use toRunes() instead

Or maybe indexing should be s[i] because there's only one O(1) operation. Same question with slicing:

s[i:j]              # O(1)
s->byteSlice(i, j)  # O(1), is this better?

Iteration:

for byte in (s) {  # Go iterates over runes, not bytes
  write (byte)
}

for rune in (s->toRunes()) {
  write (rune)
}

Substring search:

var i = s->find('//')  # remember this works without decoding

Regex:

# replace a byte or rune?
var result = s->replace( / <dot> %end /, ^"$1" )

This is just an idea. Right now we have len(s) giving the number of bytes.

Either way, the point is that strings in Oils follow exterior reality. They're arbitrary byte strings that may or may not be UTF-8 encoded. In contrast, bash strings are NUL-terminated, but they also don't have to be valid Unicode. UTF-8 is not present in the Unix kernel — it's layered on top.

Structured Data

Now that we've talked about text, let's talk about structured data. Remember that it's layered on top of text, and that it's a big YSH feature:

Shell Should Be More Like Python, JavaScript, and Ruby

Data structures follow Data languages, not vice versa

This is another way of saying that our data model is designed to be serialized, rather than serialization being an afterthought. In the intro, I said that:

Garbage-collected data structures in memory are interior. They're also ephemeral, i.e. lost when the process stops.
Data languages like JSON, TSV, and our J8 Notation extensions are exterior. They may persist on storage devices.

What interior data structures will YSH have? To follow our exterior languages, I've decided on the following data model:

Null, Bool, Int, Float, List, Dict, ...
- (... code objects like Eggex, Expr, which are left out of this discussion)

This can be described as either:

JSON plus the Int-Float distinction (not just "number" like JavaScript)
Python minus the Unicode-Bytes and List-Tuple distinctions

The idea is that interior structures and exterior languages map one-to-one, to the degree possible. I think having both ints and floats is important, because both JavaScript and Lua originally had a single number type, and grew proper integers after real usage.

So by choosing data structures to be in service of data languages, YSH is exterior-first.

The JSON-Unix string mismatch

But there are more one-to-one mapping problems. Here's the biggest one: JSON strings don't correspond to Unix strings.

This is a fairly technical issue, so I "forked" another post from this one:

How to Create a UTF-16 Surrogate Pair by Hand, with Python.
- JSON syntax is ignorant of surrogates. Errors can travel over the wire, between processes!
- This is a major reason JSON strings correspond to neither valid Unicode strings nor arbitrary bytes.

I mentioned another demo in that post:

Can the Filename \xff Be JSON-Piped Between Python and JavaScript? This would demonstrate more of the JSON-Unix String Mismatch.

See blog-code/j8-notation if you want a preview.

How do we fix this? As mentioned in Sketches of YSH Features, we're adding \yff and \u{123456} to JSON strings, and calling those "J8 strings". This is the basis of "J8 Notation".

Mathematically:

j8encode() should be a total function over bytes, which is the set of values the Unix kernel gives you.
j8encode() and j8decode() should be a pair of bijective functions.

Practically speaking, these properties make it easier to write correct shell programs. You can use J8 strings instead of ad-hoc parsing with spaces, newlines, or commas.

Personal Stories / History

We reviewed code, text, and structured data, which showed how YSH favors the exterior viewpoint.

This is because it's meant to compose seamlessly with processes not written in shell!

In other words, Oils is not a closed world. It's part of an operating system, and part of distributed systems. Again, shell is a language that grows: Unix Shell: Philosophy, Design, and FAQs.

The First JSON language I designed (2009)

My "JSON Template" project from 2009 is relevant to the exterior-first philosophy. It's a string templating language that puts serialized data first — hence the name.

It no longer has an official repo because it was hosted on Google Code, which is now defunct. Ironically, I wrote it when I worked on Google Code itself!

(It lives on in the Oil repo as test/jsontemplate.py. It's been part of the "Wild" test report for years, and I recently ported our Soil CI dashboard to it.)

Why did I create JSON Template? Google Code was written in Python and JavaScript, and I didn't like using 2 template languages: one on the server, and another on the client. (Remember how different the ecosystem was prior to 2009: node.js didn't yet exist.)

So I designed a data-driven template language, wrote an interpreter for it in Python, and ported the interpreter line-for-line to JavaScript.

For a ~1200 line program, it was surprisingly influential! It was the "version 1" of Go's text/template:

Here's Rob Pike's 2009 implementation in the Go repo, before Go was open-sourced.
Data Structures Go Programs by Russ Cox (2009) mentions this JSON Template implementation. It apparently influenced Go's reflective API — the way you transform statically typed Go structs to dynamic JSON.

We were technically co-workers, but Rob and Russ actually just found the project on Reddit. It was exciting to get this validation from much more experience engineers!

After Go 1.0, text/template was redesigned in a more imperative style. The JSON Template influence is still present in:

{{.}}               # "dot"
{{with X}} {{end}}  # push a scope, and conditionally execute

Those correspond to the primitives of JSON Template:

{@}                  # the "cursor"
{.section X} {.end}  # conditionally expand in a JSON namespace

Squarespace also started using it in 2010. I met the founder Anthony when they were a small company with a new office in Manhattan.

I thought they had moved off it, but I found this pretty recent YouTube video, which shows that it's still part of the Squarespace platform? I'd be interested if anyone understands how exactly it's used.

Coding Squarespace Templates with JSON T - You can see the {.section} {.end} syntax.

I bring this up to show that it's useful to think about serialized text first, in the Unix style. I don't think of JSON Template as a language for Python, JavaScript, or Go. It stands alone — floating in the cloud — and that requires using a language-independent representation like JSON.

My office-mate looks at JSON (2006)

I might as well drop another JSON story here: I introduced JSON to Python creator Guido van Rossum in 2006, although I'm not sure it led to anything consequential. JSON was added to the Python library a few years later via the library simplejson, which would have happened anyway.

Another one of my defunct Google Code projects was "chutils", which had a program called dice. It was basically JSON Lines or ndjson in 2006: a set of Unix utilities that communicated with JSON over pipes.

I used it to analyze logs from Google's internal dev tools. In particular, the "hist" operator avoided the ad-hoc parsing of sort | uniq -c | sort -n:

$ cat x.tar.gz | to-json-lines | hist cmd  # histogram by field name
 905 log
 405 commit
  89 rm

Guido was my office-mate at the time, and I remember he was pleasantly surprised by this "cool" use of Unix. I then showed him https://json.org/, and he said, "That's just Python!"

I believe that's almost literally true: all JSON will successfully eval() in the Python interpreter, as long as you define null, true, false = None, True, False.

Remember, this was 2006, and JSON was quietly invented in 2001. GMail and Google Maps popularized "AJAX" in 2004 and 2005, which was nominally based on XML. Server-side JavaScript didn't exist (or it was a failed Netscape experiment most people were unaware of.)

(History question: Why do Python and JavaScript share nearly the same syntax for {} and [] container literals? Python appeared in 1990, and JavaScript in 1995. Did they have a common ancestor, or did Python influence JavaScript? I recall that Guido said Python didn't invent this syntax, but I don't know where it came from. C doesn't have it.)

Oils and JSON

So despite playing with JSON for ~15 years, why is Oils coming around to it now? Well, Oils has had rough JSON support since 2019:

(Hmm, re-reading this thread is interesting, I may comment on the issue of objects vs. data later — it's interior vs. exterior !!)

JSON-based data languages are becoming more central though. I'd say the main issues are:

JSON libraries are tightly coupled to the language's data structures, which are tightly coupled to the garbage collector. And our GC didn't work until early this year.
We still have to divorce YSH from CPython, by changing PyObject* to our ASDL value_t. This is all over the codebase. I mentioned this in the 2023 Roadmap, and it's a big deal.
After that we have to write our own JSON library on top of value_t! For some reason I had put "writing our own JSON library" out of scope for Oils, but now I think it's in scope.
Based on experience with dice and JSON Template, JSON isn't always appropriate for command line tools. Some things are more naturally modeled as tables, as I mentioned in What Is a Data Frame? (2018).
I only recently figure out a good design for tables, called TSV8, which I mentioned in Sketches of YSH Features.

JSON has a few other weaknesses beside the JSON-Unix String Mismatch:

Transmitting floats exactly. (I'm not sure it's a problem for shell-like problems, e.g. awk has a fixed 6 digits of precision!)
Serializing graphs rather than trees. We have a separate solution for this problem, and binary data, which I think will be good.

Conclusion

To summarize:

I introduced the interior-exterior distinction: within a process, or between processes.
It applies to narrow waists: PyObject* is an interior waist, while Unix files are an exterior waist.

It also applies to shell design: Oils is exterior-first.

Code: YSH has interior funcs, but exterior procs are useful and idiomatic.
Text: UTF-8 is an efficient interior representation, not just an exterior one.
Data: Exterior data languages come first, then interior data structures.
- But we must smoothly bridge them, which leads to exterior "J8 strings".
- J8 strings fix the "JSON-Unix String Mismatch".
- JSON8 and TSV8 will be built on top of J8 strings.

YSH is Python-influenced, but still a shell

Now, what was the point of introducing PyObject*? It was an example of an interior narrow waist, but how does that relate to shell?

It explains the design: We won't have extensible data types like Python does! YSH is not for writing vector and matrix libraries :-)

In other words, the narrow waist of Oils is still exterior Unix files, not interior like PyObject.

It's a Python-like language, but it's still a shell. You program "directly" with text, which is now structured.

Let's fix shell

Let's end this post with another question: do these abstract ideas matter?

I think they will be our north star for a clean, focused, and bounded language design. Even though shell is a popular, fast-growing language in 2023, I frequently see comments like these, with many upvotes:

We really gotta stop writing and using software written in shell. There are so many footguns in shell that these types of mistakes are inevitable.

This tells me that the shell language has become so complex that many users have given up hope of ever writing it correctly. They don't even want to start learning it.

For this case, the problem was the difference between "$@" and eval "$@", which I mentioned in this issue.

But even that's confusing: the four characters "$@" look similar to "$x", but have wildly different semantics. And eval implicitly joins its arguments, which is even more confusing.

We have an opportunity to fix this with YSH. We're deep in the middle of it, with a lot left to do. But writing this series of posts has greatly clarified its design. We have a decision for essentially all design issues, although we'll certainly revise the language as we implement it.

I think we can produce something great!

Let me know what you think in the comments, which are now on Zulip.